xxxxxxxxxx# Data Analysis - Velib Project in [Python](https://www.python.org/) <a href="https://www.python.org/"><img src="https://s3.dualstack.us-east-2.amazonaws.com/pythondotorg-assets/media/community/logos/python-logo-only.png" style="max-width: 35px; display: inline" alt="Python"/></a> ---_Authors:_ Amine Aziz Alaoui (<small>IRT St-Exupéry</small>), J. Chevallier (<small>INSA Toulouse</small>), J. Guérin (<small>ANITI</small>), Franck Kouassi (<small>INSA Toulouse</small>), O. Roustant (<small>INSA Toulouse</small>).xxxxxxxxxxWe consider the [velib](https://www.velib-metropole.fr/donnees-open-data-gbfs-du-service-velib-metropole) data set, related to the bike sharing system of Paris. The data are loading profiles of the bike stations over one week, collected every hour, from the period Monday 2nd Sept. - Sunday 7th Sept., 2014. The loading profile of a station, or simply loading, is defined as the ratio of number of available bikes divided by the number of bike docks. A loading of 1 means that the station is fully loaded, i.e. all bikes are available. A loading of 0 means that the station is empty, all bikes have been rent.From the viewpoint of data analysis, the individuals are the stations. The variables are the 168 time steps (hours in the week). **The aim is to detect clusters in the data, corresponding to common customer usages.** This clustering should then be used to predict the loading profile.---The aim of this tutorial is to provide you a _starting point for your project_. Unsurprisingly, the first step is to get to grips with the dataset by exploring it through easy routines: - How are the data coded? - How many stations are observed? - What is the dispersion of the data? - _etc._You will find some suggested solutions in the "solutions" folder (we can certainly do better). _I can only urge you to first try to answer the questions yourself_, making sure you know which graph to use to answer the question, and then to look in the Python documentation to find out how to make a particular graph (there are lots of resources on the Internet for Python!). The counterpart to this tutorial, but in [R](https://plmlab.math.cnrs.fr/wikistat/Exploration/-/blob/master/Velib/TP_velib_R.ipynb), is also available on wikistat.We consider the velib data set, related to the bike sharing system of Paris. The data are loading profiles of the bike stations over one week, collected every hour, from the period Monday 2nd Sept. - Sunday 7th Sept., 2014. The loading profile of a station, or simply loading, is defined as the ratio of number of available bikes divided by the number of bike docks. A loading of 1 means that the station is fully loaded, i.e. all bikes are available. A loading of 0 means that the station is empty, all bikes have been rent.
From the viewpoint of data analysis, the individuals are the stations. The variables are the 168 time steps (hours in the week). The aim is to detect clusters in the data, corresponding to common customer usages. This clustering should then be used to predict the loading profile.
The aim of this tutorial is to provide you a starting point for your project. Unsurprisingly, the first step is to get to grips with the dataset by exploring it through easy routines:
You will find some suggested solutions in the "solutions" folder (we can certainly do better). I can only urge you to first try to answer the questions yourself, making sure you know which graph to use to answer the question, and then to look in the Python documentation to find out how to make a particular graph (there are lots of resources on the Internet for Python!). The counterpart to this tutorial, but in R, is also available on wikistat.
import pandas as pdimport numpy as npimport random as rdimport matplotlib.pyplot as pltimport seaborn as snssns.set_style('darkgrid')%matplotlib inlinexxxxxxxxxx## Preliminary: Load Data and Quality Assessmentxxxxxxxxxx##### <span style="color:purple"> **Todo:** Load the data</span>- `velibLoading.csv` file.- `velibCoord.csv` file- Check that loading has gone smoothly by looking at the first lines of the notebooks.velibLoading.csv file.velibCoord.csv filex
### TO BE COMPLETED ### loading = pd.read_csv('data/velibLoading.csv')# %load solutions/Python/load_loading.pypath = 'data/' # If data in 'data' directoryloading = pd.read_csv(path + 'velibLoading.csv', sep = " ")loadingxxxxxxxxxx### TO BE COMPLETED ### coord = np.pd.read_csv('data/velibCoord.csv')# %load solutions/Python/load_coord.pycoord = pd.read_csv(path + 'velibCoord.csv', sep = " ")coord.head()xxxxxxxxxx##### <span style="color:purple"> **Question:** Do these data sets contain missing data?</span>x
### TO BE COMPLETED ### print(coord.isnull().sum())#aucune donnée manquante = aucune donnée non renseignéeprint(loading.isnull().sum().sum())#la premiere sum regroupe les valeurs de chaque colonne et la deuxieme regroupe toutes les valeurs du tableaux
# %load solutions/Python/missing_value.pyloading_missing_value = loading.isna().sum().sort_values(ascending=False)print('--- Loading ---')print(loading_missing_value.sum())# --- #print('')coord_missing_value = coord.isna().sum().sort_values(ascending=False)print('--- Coord ---')print(coord_missing_value)xxxxxxxxxx##### <span style="color:purple"> **Question:** Do these data sets duplicate data?</span>### TO BE COMPLETED ### print('--- Loading ---')print(loading.duplicated().sum())# --- #print('')print('--- Coord ---')print(coord.duplicated().sum())# %load solutions/Python/duplicated.pyprint('--- Loading ---')print(loading.duplicated().sum())# --- #print('')print('--- Coord ---')print(coord.duplicated().sum())xxxxxxxxxx##### <span style="color:purple"> **Question:** Are any stations present more than once in the data set?</span>- You can use the [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) function to count the number of occurrences of each function name.- Discuss this result in the light of the previous question. If the answer is yes, we could, for example, try to visualize the different entries for the same station.value_counts() function to count the number of occurrences of each function name.### TO BE COMPLETED ### # Stations in descending order of occurrencestation_name = coord['names']print(station_name)station_name.value_counts()# %load solutions/Python/station_name.py# Stations in descending order of occurrencestation_names = coord.names.value_counts().sort_values(ascending=False)print(station_names)# --- #print('')# We display the station with the most occurrences, i.e. the station corresponding to the first line of 'station_name'.name = station_names.index[0]coord[coord.names == name]xxxxxxxxxx##### <span style="color:purple"> **Todo:** Plot the loading a station</span>- Plot the load evolution of the $i$-th station over time;- Draw a vertical line to delimit the days (_**Hint:** How many days do we observe?_);- Enter the station name in the figure title;- Label the axes in the figure.x
### TO BE COMPLETED ### i = rd.randrange(loading.shape[0])loading_data = loading.to_numpy()n_steps = loading.shape[1] # number of observed time stepstime = np.linspace(1,n_steps,n_steps) # observed time rangetime_tick = np.linspace(1, n_steps, 8) # beginning of days# --- #plt.figure(figsize = (20, 6))plt.plot(time,loading_data[i, :])plt.vlines(x = time_tick, ymin = 0, ymax = 1, colors = "orange", linestyle = "dotted", linewidth = 3)plt.xlabel('Time', fontsize = 20)plt.ylabel('Loading', fontsize = 20)plt.title(coord.names[1 + i], fontsize = 25)plt.show()# %load solutions/Python/plot_loading.pyi = rd.randrange(loading.shape[0])loading_data = loading.to_numpy()n_steps = loading.shape[1] # number of observed time stepstime_range = np.linspace(1, n_steps, n_steps) # observed time rangetime_tick = np.linspace(1, n_steps, 8) # beginning of days# --- #plt.figure(figsize = (20, 6))plt.plot(time_range, loading_data[i, :], linewidth = 2, color = 'purple')plt.vlines(x = time_tick, ymin = 0, ymax = 1, colors = "orange", linestyle = "dotted", linewidth = 3)plt.xlabel('Time', fontsize = 20)plt.ylabel('Loading', fontsize = 20)plt.title(coord.names[1 + i], fontsize = 25)plt.xticks(fontsize = 15)plt.yticks(fontsize = 15)plt.tight_layout()plt.show()xxxxxxxxxx> 0 veut dire que tous les vélos sont pris 0 veut dire que tous les vélos sont pris
xxxxxxxxxx##### <span style="color:purple"> **Question:** Does loading differ from one station to another?</span> Draw a matrix of plots of size 4*4 corresponding to 16 stations of your choice. _Do not forget the vertical lines corresponding to days_Draw a matrix of plots of size 4*4 corresponding to 16 stations of your choice. Do not forget the vertical lines corresponding to days
### TO BE COMPLETED ### fig,axs =plt.subplots(4,4,figsize = (15,12))for i in range(4): for j in range(4): k=stations[2 * i + j] axs[i,j].plot(time,loading_data[k, :]) axs[i,j].vlines(x = time_tick, ymin = 0, ymax = 1, colors = "orange", linestyle = "dotted", linewidth = 3) axs[i, j].set_title(coord.names[1 + k], fontsize = 12)x
# %load solutions/Python/plot_loading_16.py# We select 16 stations at randomstations = np.arange(loading.shape[0])rd.shuffle(stations)stations = stations[:16] # --- #fig, axs = plt.subplots(4, 4, figsize = (15,12))for i in range(4): for j in range(4): k_station = stations[4 * i + j] axs[i, j].plot(time_range, loading_data[k_station, :], linewidth = 1, color = 'purple') axs[i, j].vlines(x = time_tick, ymin = 0, ymax = 1, colors = "orange", linestyle = "dotted", linewidth = 3) axs[i, j].set_title(coord.names[1 + k_station], fontsize = 12)for ax in axs.flat: ax.set_xlabel('Time', fontsize = 12) ax.set_ylabel('Loading', fontsize = 12) ax.tick_params(axis='x', labelsize=10) ax.tick_params(axis='y', labelsize=10) plt.tight_layout()plt.show()xxxxxxxxxx> Comments?Comments?
xxxxxxxxxx##### <span style="color:purple"> **Todo:** Draw the boxplot of the variables, sorted in time order.</span>1. What can you say about the distribution of the variables? 2. Position, dispersion, symmetry? 3. Can you see a difference between days?_Hint:_ To change the graphical properties of boxplots (for example, the thickness of the median), use the [`patch_artist = True`](https://python-charts.com/distribution/box-plot-matplotlib/) argument in the `plt.boxplot` function.Hint: To change the graphical properties of boxplots (for example, the thickness of the median), use the patch_artist = True argument in the plt.boxplot function.
### TO BE COMPLETED ### bp=plt.boxplot(loading, patch_artist = True)x
# %load solutions/Python/plot_loading_disp.pyplt.figure(figsize = (20,6))# --- #bp = plt.boxplot(loading_data, widths = 0.75, patch_artist = True)for box in bp['boxes']: box.set_alpha(0.8) for median in bp['medians']: median.set(color = "Purple", linewidth=5) # --- # plt.vlines(x = time_tick, ymin = 0, ymax = 1, colors = "Orange", linestyle = "dotted", linewidth = 5)# --- #plt.xlabel('Time', fontsize = 20)plt.ylabel('Loading', fontsize = 20)plt.title("Boxplots", fontsize = 25)plt.xticks(ticks = np.arange(0, 168, 5), labels=np.arange(0, 168, 5), fontsize = 15)plt.yticks(fontsize = 15)plt.tight_layout()plt.show()xxxxxxxxxx> Comments?On voit le profil de chargement au cours de la semaine (jour par jour)Comments? On voit le profil de chargement au cours de la semaine (jour par jour)
xxxxxxxxxx##### <span style="color:purple"> **Question:** What is the average station fill rate?</span>Which station is, on average, the fullest? the least full?Which station is, on average, the fullest? the least full?
x
### TO BE COMPLETED ### print('--- Average fill rate ---')print(loading.mean().mean())#On veut la moyenne de toutes les stations à toutes les heures# Pour avoir la moyenne des colonnes pour une ligne en particulier (une station) : print(loading.mean(axis=1)) # --- #print('')print('--- Least crowded station, on average ---')mean=loading.mean(axis=1)i = mean.idxmin()print('Average fill rate :',mean[i])print(coord.loc[i])# --- #print('')print('--- Fullest station, on average ---')mean=loading.mean(axis=1)i = mean.idxmax()print('Average fill rate :',mean[i])print(coord.loc[i])[...]x
# %load solutions/Python/loading_mean.pyprint('--- Average fill rate ---')print(loading.mean().mean())# --- #print('')loading_mean = pd.Series(loading.mean(axis=1))print('--- Least crowded station, on average ---')i = loading_mean.idxmin()print('Average fill rate :',loading_mean[i])print(coord.loc[i])# --- #print('')print('--- Fullest station, on average ---')i = pd.Series(loading.mean(axis=1)).idxmax()print('Average fill rate :',loading.mean(axis=1)[i])print(coord.loc[i])xxxxxxxxxx##### <span style="color:purple"> **Question:** Does the average load vary from one station to another?</span>- Show the evolution of the average load for each station. - On the same graph, plot the average loading for the entire data set.### TO BE COMPLETED ### plt.plot(mean)plt.hlines(y=loading.mean().mean(),xmin=0,xmax=1189,colors='orange',linewidth = 5)# %load solutions/Python/plot_mean_stations.pyn_stations = loading.shape[0] # number of observed stationsstations = np.arange(n_stations)plt.figure(figsize = (20,6))# --- #plt.plot(loading_mean)plt.hlines(y = loading.mean().mean(), xmin=0, xmax=n_stations, colors = "OrangeRed", linewidth = 3)# --- #plt.xlabel('Stations', fontsize = 20)plt.ylabel('Average loading', fontsize = 20)plt.title("Average loading per station", fontsize = 25)plt.xticks(fontsize = 15)plt.yticks(fontsize = 15)plt.tight_layout()plt.show()xxxxxxxxxx> Comments?Comments?
xxxxxxxxxx##### <span style="color:purple"> **Question:** Does the average load vary over the course of a day?</span>Plot the average hourly loading for each day (on a single graph).Plot the average hourly loading for each day (on a single graph).
### TO BE COMPLETED ### mean_per_hour_per_day = loading.mean(axis = 0).to_numpy()mean_per_hour_per_day = mean_per_hour_per_day.reshape((7, 24))days = ["Monday", "Tuesday", "Wednesday","Thursday", "Friday", "Saturday", "Sunday"]plt.figure(figsize = (15,10))plt.plot(mean_per_hour_per_day.transpose())plt.plot(mean_per_hour, color = "black", linewidth = 3)# %load solutions/Python/plot_mean_hours.pymean_per_hour_per_day = loading.mean(axis = 0).to_numpy()mean_per_hour_per_day = mean_per_hour_per_day.reshape((7, 24))mean_per_hour = mean_per_hour_per_day.mean(axis=0)# --- #days = ["Monday", "Tuesday", "Wednesday","Thursday", "Friday", "Saturday", "Sunday"]plt.figure(figsize = (15,10))plt.plot(mean_per_hour_per_day.transpose())plt.plot(mean_per_hour, color = "black", linewidth = 3)plt.xlabel('Hourly loading, averaged over all stations', fontsize = 20)plt.ylabel('Loading', fontsize = 20)plt.legend(days + ['Weekly'])plt.xticks(ticks = np.arange(0,24,4), labels=np.arange(0,24,4), fontsize = 15) plt.tight_layoutplt.show()xxxxxxxxxx> Comments?Comments?
import matplotlib.cm as cmimport matplotlib.patches as mpatchesimport plotly.express as pxxxxxxxxxxx##### <span style="color:purple"> **Question:** Where are the velib stations located?</span>- Plot the stations coordinates on a 2D map (latitude _vs._ longitude)- Use the average hourly loading as a color scale- You can consider different times of day, for example 6am, 12pm, 11pm on Monday, or the average weekly load at 6am.- You can consider different days at the same time, or the average load for each day.- You can use the [`scatter_mapbox`](https://plotly.com/python/scattermapbox/) function of the [`plotly.express`](https://plotly.com/python/plotly-express/) to charge the map of Parisscatter_mapbox function of the plotly.express to charge the map of Paris### TO BE COMPLETED ### ## Simple 2D representation# Monday at hour 6h, 12h, 23h# Hours to be displayedhours = ...# --- #[...]# %load solutions/Python/plot_loading_2D_monday.py## Simple 2D representation# Monday at 6h, 12h and 23hhours = [6, 12, 23]# --- #s, n = 10, len(hours)fig, axs = plt.subplots(1, n, figsize = (s*n, s))for (i,h) in enumerate(hours): im = axs[i].scatter(coord.latitude, coord.longitude, c = loading_data[:,h], cmap = cm.plasma_r) axs[i].set_title('Stations loading - Monday {} h'.format(h), fontsize = 25) plt.colorbar(im, ax=axs[i]) for ax in axs.flat: ax.set_xlabel('Latitude', fontsize = 20) ax.set_ylabel('Longitude', fontsize = 20) ax.tick_params(axis='x', labelsize=15) ax.tick_params(axis='y', labelsize=15)plt.tight_layout()plt.show()xxxxxxxxxx> Comments?Comments?
xxxxxxxxxx### TO BE COMPLETED ### ## Simple 2D representation# Loading at 6pm, depending on the day of the week[...]# %load solutions/Python/plot_loading_2D_18h.py## Simple 2D representation# Loading at 6pm, depending on the day of the weekh = 18hours = np.arange(h, 168, 24)load_per_hour = loading_data[:, hours]days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]# --- #s, m = 10, 3k = 1 + len(days)//mfig = plt.figure(figsize=(s+1, s))fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=.3, wspace=.25)for (i,d) in enumerate(days): ax = fig.add_subplot(k, m, i+1) im = ax.scatter(coord.latitude, coord.longitude, c = load_per_hour[:,i], cmap = cm.plasma_r) plt.colorbar(im) ax.set_title('Stations loading - ' + d + ' {} h'.format(h)) ax.set_xlabel('Latitude') ax.set_ylabel('Longitude') ax.tick_params(axis='x') ax.tick_params(axis='y')#plt.tight_layout()plt.show()xxxxxxxxxx> Comments?Comments?
### TO BE COMPLETED ### ## Visualization on the Paris map[...]# %load solutions/Python/plot_loading_map.py## Visualization on the Paris map# Weekly average at 6 p.m.h = 18hours = np.arange(h, 168, 24)load_per_hour = loading_data[:, hours].mean(axis=1)# --- # fig = px.scatter_mapbox(coord, lat = 'latitude', lon = 'longitude', mapbox_style = "carto-positron", color = load_per_hour, color_continuous_scale = px.colors.sequential.Plasma_r, #size = load_per_hour, zoom = 10, opacity = .9, title = 'Stations loading - Weekly average at {} h'.format(h))fig.show()xxxxxxxxxx> Comments?Comments?
xxxxxxxxxx## Influence of Altitude Difference on Station Loadingxxxxxxxxxx##### <span style="color:purple"> **Question:** Does Paris have many hilltop stations?</span>- Compare the number of hilltop stations with the others.xxxxxxxxxxloading_hill = ...[...]xxxxxxxxxx# %load solutions/Python/hilltop_stations.pyxxxxxxxxxx##### <span style="color:purple"> **Question:** Are hilltop stations more crowded than others?</span>- Plot the stations coordinates on a 2D map (latitude _vs._ longitude), using a different color for stations which are located on a hill.- Redo the initial study, but distinguish hilltop stations from others.xxxxxxxxxx### TO BE COMPLETED ### ## Simple 2D representation[...]xxxxxxxxxx# %load solutions/Python/hilltop_stations_2D.pyxxxxxxxxxx### TO BE COMPLETED ### ## Visualization on the Paris mapcoord['hill'] = coord['bonus'].astype('category') # convert to categorical[...]xxxxxxxxxx# %load solutions/Python/hilltop_stations_map.py